Variable Selection, Trade-offs and Cross-Validation
Linear Model Selection
In the regression setting, the standard linear model, \[Y= \beta_0+\beta_1 X_1+...+\beta_p X_p+\epsilon\]
is commonly used to describe the relationship between a response Y and a set of variables \(X_1,…,X_p\). We also noted that one typically fits this model using least squares. In the chapters that follow, we consider some approaches for extending the linear model framework. Although it is possible to generalize this model in order to accommodate non-linear, but still additive, relationships, and even more general non-linear models. However, the linear model has distinct advantages in terms of inference and, on real-world problems, is often surprisingly competitive in relation to non-linear methods. Hence, before moving to the non-linear world, we should discuss some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures. Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability.
Prediction Accuracy: Provided that the true relationship between the response and the predictors is approximately linear, the least squares estimates will have low bias. Moreover, if n, the number of observations, is much larger than p, the number of variables — then the least squares estimates tend to also have low variance, and hence will perform well. However, if n is not much larger than p, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions. And if \(p>n\), then there is no longer a unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all. By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias. This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training.
Model Interpretability: It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response. Including such irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables — that is, by setting the corresponding coefficient estimates to zero — we can obtain a model that is more easily interpreted. Now least squares is extremely unlikely to yield any coefficient estimates that are exactly zero. In this chapter, we see some approaches for automatically performing feature selection or variable selection — that is, feature selection variable selection for excluding irrelevant variables from a multiple regression model. There are, in general, three important classes of methods:
Subset Selection: This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.
Shrinkage: This approach involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Hence, shrinkage methods can also perform variable selection.
Dimension Reduction: This approach involves projecting the p predictors into an M-dimensional subspace, where \(M<p\). This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.
Choosing the Optimal Model
Best subset selection, forward selection, and backward selection result in the creation of a set of models, each of which contains a subset of the p predictors. In order to implement these methods, we need a way to determine which of these models is best. As we discussed before, the model containing all of the predictors will always have the smallest \(RSS\) and the largest \(R^2\), but these are not suitable for selecting the best model among a collection of models with different numbers of predictors. In order to select the best model, there are two common approaches:
We can indirectly estimate test error by making an adjustment to account for the bias due to overfitting.
We can directly estimate the test error, using either a validation set approach or a cross-validation approach.
We consider both of these approaches below.
Mallows’s \(C_p\)
For a fitted least squares model containing d predictors, the \(C_p\) estimate of MSE is computed using the equation
\[ C_p=\frac{1}{n}\left(R S S+2 d \hat{\sigma}^2\right) \] where \(\hat{\sigma}^2\) is an estimate of the variance of the error \(\epsilon\) associated with each response measurement. Essentially, the \(C_p\) statistic adds a penalty of \(2 d \hat{\sigma}^2\) to RSS in order to adjust for the fact that these tend to underestimate the test error. Clearly, the penalty increases as the number of predictors in the model increases; this is intended to adjust for the corresponding decrease in training RSS. Though it is beyond the scope of this book, one can show that if \(\hat{\sigma}^2\) is an unbiased estimate of \(\sigma^2\), then \(C_p\) is an unbiased estimate of test MSE. As a consequence, the \(C_p\) statistic tends to take on a small value for models with a low test error, so when determining which of a set of models is best, we choose the model with the lowest \(C_p\) value.
Akaike Information Criterion (AIC)
The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of the model with Gaussian errors, maximum likelihood and least squares are the same thing. In this case AIC given by
\[ AIC=\frac{1}{n\hat\sigma^2}\left(R S S+2 d \hat{\sigma}^2\right) \] where, for simplicity, we have omitted an additive constant. Hence, for least squares models, \(C_p\) and AIC are proportional to each other.
Bayesian Information Criterion (BIC)
BIC is derived from a Bayesian point of view, but ends up looking similar to \(C_p\) and AIC as well. For the least squares model with \(d\) predictors, the BIC is, up to irrelevant constants, given by \[ \mathrm{BIC}=\frac{1}{n \hat{\sigma}^2}\left(R S S+\log (n) d \hat{\sigma}^2\right) \] Like \(C_p\), the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value. Since \(\log (n)>2\) for any \(n>7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_p\).
Also we can use Adjusted \(R^2\) along with the above measurements.
Validation and Cross-Validation
As an alternative to the approaches just discussed, we can directly estimate the test error using the validation set and cross-validation methods. We can compute the validation set error or the cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest. This procedure has an advantage relative to AIC, BIC, \(C_p\), and adjusted \(R^2\), in that it provides a direct estimate of the test error, and makes fewer assumptions about the true underlying model. It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance \(\sigma^2\).
Subset Selection
In this section we consider some methods for selecting subsets of predictors. These include best subset and stepwise model selection procedures.
1. Best Subset Selection
To perform best subset selection, we fit a separate least squares regression for each possible combination of the \(p\) predictors. That is, we fit all \(p\) models that contain exactly one predictor, \(\left(\begin{array}{l}p \\ 2\end{array}\right)=\frac{p(p-1)}{2}\) models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best. The problem of selecting the best model from among the \(2^p\) possibilities considered by best subset selection is not trivial. This is usually broken up into two stages, as described in the following algorithm:
Best Subset Selection Algorithm
Let \(M_0\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
For \(k=1,2, \ldots, p\)
a). Fit all \(\left(\begin{array}{l}p \\ k\end{array}\right)\) models that contain exactly \(k\) predictors.
b). Pick the best among these \(\left(\begin{array}{l}p \\ k\end{array}\right)\) models, and call it \(M_k\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).
Select a single best model from among \(M_0, M_1, \ldots, M_p\) using cross-validated prediction error, \(C_p\), (AIC), BIC, or adjusted \(R^2\).
Step 2 identifies the best model for each subset size, in order to reduce the problem from one of \(2^p\) possible models to one of \(p+1\) possible models. Now in order to select a single best model, we must simply choose among these \(p+1\) options. This task must be performed with care, because the RSS of these \(p+1\) models decreases monotonically, and the \(R^2\) increases monotonically, as the number of features included in the models increases. Therefore, if we use these statistics to select the best model, then we will always end up with a model involving all of the variables. Therefore, in Step 3 , we use cross-validated prediction error, \(C_p\), BIC, or adjusted \(R^2\) in order to select among \(M_0, M_1, \ldots, M_p\). These approaches are discussed later.
Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression. In the case of logistic regression, instead of ordering models by RSS in Step 2 of the algorithm, we instead use the deviance, a measure that plays the role of RSS for a broader class of models. The deviance is negative two times the maximized loglikelihood; the smaller the deviance, the better the fit. While best subset selection is a simple and conceptually appealing approach, it suffers from computational limitations. The number of possible models that must be considered grows rapidly as \(p\) increases. In general, there are \(2^p\) models that involve subsets of \(p\) predictors. So if \(p=10\), then there are approximately one thousand possible models to be considered, and if \(p=20\), then there are over one million possibilities! Consequently, best subset selection becomes computationally in-feasible for values of p greater than around 40, even with extremely fast modern computers.
Exmaple: Hitters Data set
library(ISLR)
data("Hitters")
# First of all, we note that the Salary variable is missing for
#some of the players
sum(is.na(Hitters$Salary))## [1] 59
## [1] 263
The regsubsets() function (part of the
leaps library) performs best subset selection by
identifying the best model that contains a given number of predictors,
where best is quantified using RSS.
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
An asterisk indicates that a given variable is included in the
corresponding model by default regsubsets() only do up to
best 8-variable model
# the nvmax option can be used in order to return as many variables as are
#desired.
regfit.full=regsubsets(Salary~.,data=Hitters ,nvmax=19)
(reg.summary=summary(regfit.full))## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
We can examine these to try to select the best overall model
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
## [1] 19
## [1] 19
## [1] 11
## [1] 10
## [1] 6
Plot RSS, adjusted \(R^2\), \(C_p\), and BIC for all of the models at once
par(mfrow=c(2,2)) # dividing the plot in to 4 parts
plot(reg.summary$rss ,xlab="Number of Variables ",ylab="RSS",
type="l")
points(which.min(reg.summary$rss),min(reg.summary$rss), col="red",cex=2,pch=20)
plot(reg.summary$adjr2 ,xlab="Number of Variables ",
ylab="Adjusted RSq",type="l")
points(which.max(reg.summary$adjr2),max(reg.summary$adjr2), col="red",
cex=2,pch=20)
plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type="l")
points(which.min(reg.summary$cp ),min(reg.summary$cp),col="red",cex=2,pch=20)
plot(reg.summary$bic ,xlab="Number of Variables ",ylab="BIC",type="l")
points(which.min(reg.summary$bic),min(reg.summary$bic),col="red",cex=2,pch=20)The regsubsets() function has a built-in plot() command
which can be used to display the selected variables for the best model
with a given number of predictors, ranked according to the BIC, \(Cp\), adjusted \(R^2\), or AIC.
To find out more about this function, type
?plot.regsubsets.
par(mfrow=c(2,2))
plot(regfit.full,scale="r2")
plot(regfit.full,scale="adjr2")
plot(regfit.full,scale="Cp")
plot(regfit.full,scale="bic")The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic.
According to lowest BIC it selects the model with 6 parameters
## (Intercept) AtBat Hits Walks CRBI DivisionW
## 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
## PutOuts
## 0.2643076
So the fitted model:
##
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CRBI + Division +
## PutOuts, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -873.11 -181.72 -25.91 141.77 2040.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.51180 65.00006 1.408 0.160382
## AtBat -1.86859 0.52742 -3.543 0.000470 ***
## Hits 7.60440 1.66254 4.574 7.46e-06 ***
## Walks 3.69765 1.21036 3.055 0.002488 **
## CRBI 0.64302 0.06443 9.979 < 2e-16 ***
## DivisionW -122.95153 39.82029 -3.088 0.002239 **
## PutOuts 0.26431 0.07477 3.535 0.000484 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 319.9 on 256 degrees of freedom
## Multiple R-squared: 0.5087, Adjusted R-squared: 0.4972
## F-statistic: 44.18 on 6 and 256 DF, p-value: < 2.2e-16
2. Forward Stepwise Selection
Forward stepwise selection is a computationally efficient alternative to best subset selection. While the best subset selection procedure considers all \(2^p\) possible models containing subsets of the \(p\) predictors, forward stepwise considers a much smaller set of models. Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model. More formally, the forward stepwise selection procedure is given in the algorithm below.
Unlike best subset selection, which involved fitting \(2^p\) models, forward stepwise selection involves fitting one null model, along with \(p-k\) models in the \(k\) th iteration, for \(k=0,1, \ldots, p-1\). This amounts to a total of \[ 1+\sum_{k=0}^{p-1}(p-k)=1+\frac{p(p+1)}{2} \] models. This is a substantial difference: when \(p=20\), best subset selection requires fitting \(1,048,576\) models, whereas forward stepwise selection requires fitting only 211 models.
Forward Stepwise Selection Algorithm
Let \(M_0\) denote the null model, which contains no predictors.
For \(k=0,1, \ldots, p-1:\)
a). Consider all \(p-k\) models that augment the predictors in \(M_k\) with one additional predictor.
b). Choose the best among these \(p-k\) models, and call it \(M_k\). Here best is defined as having smallest RSS or highest \(R^2\).
Select a single best model from among \(M_0, M_1, \ldots, M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).
In Step 2(b) of the algorithm, we must identify the best model from among those \(p-k\) that augment \(M_k\) with one additional predictor. We can do this by simply choosing the model with the lowest RSS or the highest \(R^2\). However, in Step 3, we must identify the best model among a set of models with different numbers of variables. This is more challenging and will be discussed later. Forward stepwise selection’s computational advantage over best subset selection is clear. Though forward stepwise tends to do well in practice, it is not guaranteed to find the best possible model out of all \(2^p\) models containing subsets of the \(p\) predictors. For instance, suppose that in a given data set with \(p=3\) predictors, the best possible one-variable model contains \(X_1\), and the best possible two-variable model instead contains \(X_2\) and \(X_3\). Then forward stepwise selection will fail to select the best possible two-variable model, because \(M_1\) will contain \(X_1\), so \(M_2\) must also contain \(X_1\) together with one additional variable.
Forward stepwise selection can be applied even in the high-dimensional setting where \(n<p\); however, in this case, it is possible to construct sub-models \(M_0, \ldots, M_{n-1}\) only, since each sub-model is fit using least squares, which will not yield a unique solution if \(p \geq n\).
Example: Hitters data set Cts…
Using the argument method="forward"
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
We see that using forward stepwise selection, the best one-variable
model contains only CRBI, and the best two-variable model
additionally includes Hits.
3. Backward Stepwise Selection
Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.
Example: Hitters data set Cts…
Using the argument method="backward"
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*" " "
## 5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " " " "
## 5 ( 1 ) " " " " " " "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Backward Stepwise Selection Algorithm
Let \(M_p\) denote the full model, which contains all \(p\) predictors.
For \(k=p, p-1, \ldots, 1\) :
a). Consider all \(k\) models that contain all but one of the predictors in \(M_k\), for a total of \(k-1\) predictors.
b). Choose the best among these \(\mathrm{k}\) models, and call it \(M_{k-1}\). Here best is defined as having smallest RSS or highest \(R^2\).
Select a single best model from among \(M_0, M_1, \ldots, M_p\) using cross-validated prediction error, \(C_p\), BIC, or adjusted \(R^2\).
4. Hybrid Approaches
The best subset, forward stepwise, and backward stepwise selection approaches generally give similar but not identical models. As another alternative, hybrid versions of forward and backward stepwise selection are available, in which variables are added to the model sequentially, in analogy to forward selection. However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit. Such an approach attempts to more closely mimic best subset selection while retaining the computational advantages of forward and backward stepwise selection.
Example: Hitters data set Cts…
Using the argument method="seqrep"
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "seqrep")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: 'sequential replacement'
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" "*" "*" "*" " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
However, they have different best seven-variable models.
## (Intercept) Hits Walks CAtBat CHits CHmRun
## 79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073 1.4420538
## DivisionW PutOuts
## -129.9866432 0.2366813
## (Intercept) AtBat Hits Walks CRBI CWalks
## 109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622 -0.3053070
## DivisionW PutOuts
## -127.1223928 0.2533404
## (Intercept) AtBat Hits Walks CRuns CWalks
## 105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095 -0.7163346
## DivisionW PutOuts
## -116.1692169 0.3028847
## (Intercept) AtBat Hits Walks CRuns CWalks
## 105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095 -0.7163346
## DivisionW PutOuts
## -116.1692169 0.3028847
Therefore, depending on the technique you are using, you will obtain different models.
The Trade-offs between Prediction Accuracy and Model Interpretability
A representation of the trade-off between flexibility and interpretability, using different statistical learning methods. In general, as the flexibility of a method increases, its interpretability decreases.
One might reasonably ask the following question: why would we ever choose to use a more restrictive method instead of a very flexible approach?
There are several reasons that we might prefer a more restrictive model. If we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between \(Y\) and \(X_1,X_2,…,X_p\). In contrast, very flexible approaches, such as the splines and the boosting methods, can lead to such complicated estimates of f that it is difficult to understand how any individual predictor is associated with the response.
Reducible and Irreducible errors
Consider a given estimate \(\hat{f}\) and a set of predictors \(X\), which yields the prediction \(\hat{Y}=\hat{f}(X)\). Assume for a moment that both \(\hat{f}\) and \(X\) are fixed. Then, it is easy to show that:
where \(E(Y-\hat{Y})^2\) represents the average, or expected value, of the squared difference between the predicted and actual value of \(Y\), and \(\operatorname{Var}(\epsilon)\) represents the variance associated with the error term \(\epsilon\).
Why is the irreducible error larger than zero?
The quantity \(\epsilon\) may contain unmeasured variables that are useful in predicting Y : since we don’t measure them, f cannot use them for its prediction.
The quantity \(\epsilon\) may also contain un-measurable variation.
For example, the risk of an adverse reaction might vary for a given patient on a given day, depending on manufacturing variation in the drug itself or the patient’s general feeling of well-being on that day.
The focus of this lesson is on techniques for estimating f with the aim of minimizing the - reducible error.
It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for Y .
This bound is almost always unknown in practice.
Training error Vs Testing Error
The test error is the average error that results from using a statistical learning method to predict the response on a new observation, one that was not used in training the method.
In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training.
But the training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the later.
Bias - Variance Trade-off
Earlier we derived that error can be divided in to two parts as Reducible and Irreducible errors.
\[\begin{aligned} E(Y-\hat{Y})^2 & =E[f(X)+\epsilon-\hat{f}(X)]^2 \\ & ={[f(X)-\hat{f}(X)]^2}+{\operatorname{Var}(\epsilon)} \end{aligned}\]
Further more we can divide the reducible error as Variance and the Bias of the fitted model.
\[E\left(Y-\hat{Y}\right)^2=\operatorname{Var}\left(\hat{f}\left(X\right)\right)+\left[\operatorname{Bias}\left(\hat{f}\left(X\right)\right)\right]^2+\operatorname{Var}(\epsilon)\]
What is Bias?
Error between average model prediction and ground truth.
\[ Bias = E(\hat{f}(X)) - f(X)\]
The bias of the estimated function tells us the capacity of the underlying model to predict the values.
What is Variance?
Average variability in the model prediction for the given dataset.
\[ Variance = E[\hat{f}(X)-E(\hat{f}(X))^2] \]
The variance of the estimated function tells you how much the function can adjust to the change in the dataset.
The Trade-off:
Typically as the flexibility of the model increases, its variance increases, and its bias decreases. So choosing the flexibility based on average test error amounts to a “Bias-Variance Trade-off”.
Some other Trade-offs:
Prediction accuracy Vs Interpretability:
- Linear model are easy to interpret and thin-splines are not.
Good fit Vs Over-fit or Under-fit.
- How do we know when the fit is just right?
Parsimony Vs Black-box
- We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.
Resampling Methods
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.
Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive.
In this section, we discuss one of the most commonly used resampling methods, cross-validation. This is a important tool in the practical application of many statistical learning procedures.
Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection.
In the past, performing cross-validation was computationally prohibitive for many problems with large \(p\) and/or large \(n\), and so \(AIC\), \(BIC\), \(C_p\), and adjusted \(R^2\) were more attractive approaches for choosing among a set of models. However, nowadays with fast computers, the computations required to perform cross-validation are hardly ever an issue. Thus, cross validation is a very attractive approach for selecting from among a number of models under consideration.
The Validation Set Approach
Suppose that we would like to estimate the test error associated with fitting a particular statistical learning method on a set of observations. The validation set approach is a very simple strategy for this task.
It involves randomly dividing the available set of observations into two parts, a training set and a validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set.
The resulting validation set error rate, typically assessed using MSE in the case of a quantitative response, provides an estimate of the test error rate.
Example: Hitters data set Cts…
For this example, let’s try to conduct a validation set approach.
We will use the Best subset regression to get 19 different models with different number of variables and use validation set approch to figure out the best model among them.
Since the data splitting is random, we will establish a
seed value for reproducibility.
set.seed (1) # reproducibility
#data splitting 50%,50%
train=sample(c(TRUE,FALSE), nrow(Hitters),rep=TRUE)
test=(!train)Now apply best subset selection to the training set
Next make a model matrix from the test data. The
model.matrix() function is used in many regression packages
for building an X matrix from data.
Now we run a loop, and for each size i, we extract the coefficients
from regfit.best for the best model of that size, multiply
them into the appropriate columns of the test model matrix to form the
predictions, and compute the test MSE.
To store the test MSE values from all 19 models, let create a place holder vector for that.
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
There is no predict() function for
regsubsets(). Therefore, we need to calculate the predicted
values manually according to the following formula for all 19 different
models:
\[ \hat{Y} = X\hat\beta\]
for(i in 1:19){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Hitters$Salary[test]-pred)^2)
}
#MSE values for all 19 models.
val.errors## [1] 164377.3 144405.5 152175.7 145198.4 137902.1 139175.7 126849.0 136191.4
## [9] 132889.6 135434.9 136963.3 140694.9 140690.9 141951.2 141508.2 142164.4
## [17] 141767.4 142339.6 142238.2
Next let’s find the model which has the lowest MSE value.
## [1] 7
According to the lowest MSE value, the model with 7 variables is the best model.
Following are the coefficients for the model.
## (Intercept) AtBat Hits Walks CRuns CWalks
## 67.1085369 -2.1462987 7.0149547 8.0716640 1.2425113 -0.8337844
## DivisionW PutOuts
## -118.4364998 0.2526925
Drawbacks of validation set approach
The validation set approach is conceptually simple and is easy to implement. But it has two potential drawbacks:
The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validation set—are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
k-Fold Cross-Validation
This approach involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining \(k-1\) folds. The mean squared error, \(\mathrm{MSE}_1\), is then computed on the observations in the held-out fold. This procedure is repeated \(k\) times; each time, a different group of observations is treated as a validation set. This process results in \(k\) estimates of the test error, \(\operatorname{MSE}_1, \mathrm{MSE}_2, \ldots, \mathrm{MSE}_k\). The \(k\)-fold CV estimate is computed by averaging these values, \[ \mathrm{CV}_{(k)}=\frac{1}{k} \sum_{i=1}^k \operatorname{MSE}_i\]
In practice, one typically performs k-fold CV using \(k = 5\) or \(k = 10\).
Example: Hitters data set Cts…
Important: Since there is no
predict function for regsubsets() we will be
using the following function again, we can capture our steps above and
write our own predict method.
predict.regsubsets = function(object, newdata, id, ...){
form=as.formula(object$call[[2]])
mat=model.matrix(form, newdata)
coefi=coef(object, id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}Similar to the previous example let’s try to use 10-fold cross validation for the data set and select the best model among the 19 different models.
k=10 #10- fold cross validation
set.seed(1)
#creating the folds
(folds=sample(1:k,nrow(Hitters),replace=TRUE))## [1] 9 4 7 1 2 7 2 3 1 5 5 10 6 10 7 9 5 5 9 9 5 5 2 10 9
## [26] 1 4 3 6 10 10 6 4 4 10 9 7 6 9 8 9 7 8 6 10 7 3 10 6 8
## [51] 2 2 6 6 1 3 3 8 6 7 6 8 7 1 4 8 9 9 7 4 7 6 1 5 6
## [76] 1 9 7 7 3 6 2 10 10 7 3 2 10 1 10 10 8 10 5 7 8 5 6 8 1
## [101] 3 10 3 1 6 6 4 9 5 1 3 6 3 7 3 3 1 9 2 8 6 1 2 7 7
## [126] 4 9 8 3 5 3 4 2 1 7 9 10 10 2 2 3 1 2 3 3 3 8 9 2 10
## [151] 8 10 4 5 9 5 7 5 6 4 2 1 3 8 9 6 1 4 5 9 5 8 4 1 9
## [176] 5 1 5 4 10 10 9 8 5 5 6 6 2 2 8 4 10 8 5 5 8 8 7 4 4
## [201] 1 10 4 9 9 9 9 6 6 4 3 3 9 9 7 9 5 7 4 4 10 8 1 10 2
## [226] 10 1 1 4 5 5 6 9 8 5 1 2 1 8 5 8 10 7 7 2 9 4 2 5 2
## [251] 4 3 6 9 7 5 5 1 1 10 1 3 10
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## [1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [7,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [8,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [9,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [10,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Now let’s write a for loop that performs cross-validation
for(j in 1:k){
best.fit=regsubsets(Salary~., data=Hitters[folds!=j,], nvmax=19)
for(i in 1:19){
pred = predict(best.fit, Hitters[folds==j,], id=i)
cv.errors[j,i] = mean( (Hitters$Salary[folds==j]-pred)^2 )
}
}This has given us a 10x19 matrix, of which the \((i,j)^{th}\) element corresponds to the test MSE for the \(i^{th}\) cross-validation fold for the best j-variable model.
## 10
## 10
We now perform best subset selection on the full data set to obtain the 10-variable model.
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
## CRBI CWalks DivisionW PutOuts Assists
## 0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
Leave-one out cross-validation (LOOCV)
In a setting when we set \(k=n\) yields n-fold or leave-one out cross-validation
LOOCV has the potential to be expensive to implement, since the model has to be fit n times. This can be very time consuming if n is large, and if each individual model is slow to fit. With least squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds:
\[\mathrm{CV}_{(n)}=\frac{1}{n} \sum_{i=1}^n\left(\frac{y_i-\hat{y}_i}{1-h_i}\right)^2\]
Cross-validation is a very general approach that can be applied to almost any statistical learning method. Some statistical learning methods have computationally intensive fitting procedures, and so performing LOOCV may pose computational problems, especially if \(n\) is extremely large.
Cross-Validation: Using Caret Package
This is another example of using cross validation, but instead of
using the leaps package here we are using the
Caret package.
For this example we use the stackloss data set.
# Specifying the cross validation method
train_control = trainControl(method="CV", number = 5)
#fit a regression model and use k-fold CV to evaluate performance
model1 = train(stack.loss~Air.Flow+Water.Temp+Acid.Conc.,
data =data1, method ="lm",
trControl = train_control)
#view summary of k-fold CV
print(model1)## Linear Regression
##
## 21 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 17, 17, 18, 16, 16
## Resampling results:
##
## RMSE Rsquared MAE
## 3.746382 0.9350616 3.126287
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Another model:
model2 = train(stack.loss~Air.Flow+I(Air.Flow^2)+Water.Temp+I(Water.Temp^2)+
Acid.Conc.+I(Acid.Conc.^2),
data =data1, method ="lm",
trControl = train_control)
print(model2)## Linear Regression
##
## 21 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 18, 16, 17, 16, 17
## Resampling results:
##
## RMSE Rsquared MAE
## 3.92868 0.9405023 3.103545
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) Air.Flow `I(Air.Flow^2)` Water.Temp
## -52.516836 1.448626 -0.007464 -9.745186
## `I(Water.Temp^2)` Acid.Conc. `I(Acid.Conc.^2)`
## 0.264509 2.448413 -0.015464
## RMSE Rsquared MAE Resample
## 1 1.824413 0.9760729 1.334178 Fold1
## 2 4.286561 0.8936954 3.467590 Fold2
## 3 3.753380 0.9937847 2.830641 Fold3
## 4 5.484813 0.8434885 4.145861 Fold4
## 5 4.294233 0.9954699 3.739455 Fold5
Here is how to interpret the output:
No pre-processing occurred. That is, we didn’t scale the data in any way before fitting the models.
The resampling method we used to evaluate the model was cross-validation with 5 folds.
The sample size for each training set was given.
RMSE: The root mean squared error. This measures the average difference between the predictions made by the model and the actual observations. The lower the RMSE, the more closely a model can predict the actual observations.
Rsquared: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the R-squared, the more closely a model can predict the actual observations.
MAE: The mean absolute error. This is the average absolute difference between the predictions made by the model and the actual observations. The lower the MAE, the more closely a model can predict the actual observations.
Each of the three metrics provided in the output (RMSE, R-squared, and MAE) give us an idea of how well the model performed on previously unseen data.
In practice we typically fit several different models and compare the three metrics provided by the output seen here to decide which model produces the lowest test error rates and is therefore the best model to use.